Transcriptome analysis and development of EST-SSR markers in Anoectochilus emeiensis (Orchidaceae)

Anoectochilus emeiensis K. Y. Lang, together with other Anoectochilus species, has long been used as the main source of many traditional Chinese medicines. Owing to the shortcomings of molecular markers, the study of the genetic diversity and medicinal component synthesis mechanism of the endemic Anoectochilus species has been delayed. In this study, I carried out a transcriptome analysis of A. emeiensis. A total of 78,381 unigenes were assembled from 64.2 million reads, and 47,541 (60.65%) unigenes were matched to known proteins in the public databases. Then, 9284 expressed sequence tag-derived simple sequence repeats (EST-SSRs) were identified, and the frequency of SSRs in the A. emeiensis transcriptome was 9.88%. Trinucleotide repeats (3699, 39.84%) were the most common type, followed by dinucleotide (3251, 35.02%) and mononucleotide (1750, 18.85%) repeats. Based on the SSR sequence, 6683 primer pairs were successfully designed, 40 primer pairs were randomly selected, and 10 primer pairs were identified as polymorphic loci from 186 individuals of A. emeiensis. The EST-SSR markers examined in this study will be informative for future population genetic studies of A. emeiensis.


Introduction
Anoectochilus emeiensis K. Y. Lang, an endemic ethnomedicinal plant, is distributed on Mount Omei, southwestern China. It is a traditional Chinese perennial herbal medicine, similar to other species of the genus Anoectochilus (Orchidaceae), including A. formosanus, A. roxburghii, and A. koshunensis. In Chinese folk, it was named "King Medicine" owing to its diverse pharmacological effects such as the remedy for diabetes, hypertension, fever, tumor, liver disease, snakebite, and lung disease [1]. Anoectochilus species are usually diploid, with 40 chromosomes [2]. The whole diploid genome DNA content has been found to be approximately 6.83 pg based on flow cytometry analysis [3]. Overexploitation and depletion have occurred with the constantly increasing demand for wild A. emeiensis and other Anoectochilus species for their medicinal properties and ornamental value. Currently, A. emeiensis is considered "critically endangered" by the China Species Red List because of its habitat degradation or loss, narrow distribution, few populations, and small population sizes [4].

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0278551 December 6, 2022 1 / 13 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The revelation of population structure, local adaptation, and differences in diversity between populations is vital for understanding the genetic variation within a species [5]. Previous studies on A. emeiensis and other Anoectochilus species have mainly focused on the evaluation of germplasm resources, seed breeding, artificial cultivation techniques, chemical composition, and pharmacological effects [1]. However, molecular research and wild population protection of A. emeiensis have fallen behind.
As the most popular markers in population genetic studies, simple sequence repeats have been increasingly used in fingerprinting, parentage analysis, genetic mapping, and genetic structure analysis because of their codominant and highly polymorphic nature in recent years [6][7][8]. In the past two decades, next-generation sequencing technology has been developed so quickly that not only model but also non-model organisms without sequenced genomes have been successfully sequenced for their high-throughput, accurate, and cost-effective advantages in transcriptome analysis [9][10][11][12][13]. Transcriptome sequencing is an efficient process for producing expressed sequence tags that are valuable for the development of molecular markers, gene annotation and discovery, and expression profiling [9,11,14]. Because of the lengthy and costly development phase and relatively low throughput, the development of single sequence repeats (SSRs) based on genome sequencing has not been as frequently adopted as transcriptome sequencing. Expressed sequence tag-derived SSR (EST-SSR) markers have been successfully developed by de novo transcriptome sequencing and assembly in many crops, such as barley, rice, and wheat [15][16][17], and in some Chinese herbal medicines, such as Dendrobium officinale, Dysosma versipellis, Panax ginseng, Panax vietnamensis, and Siberian ginseng [18][19][20][21]. To better protect wild germplasm resources, it is imperative to sequence the transcriptome of A. emeiensis. In the present study, the transcriptome of A. emeiensis leaves was sequenced using the Illumina HiSeq 2000 platform, and several EST-SSRs were derived. To our knowledge, this is the first systematic report of the complete transcriptome of A. emeiensis. The aims of this study were to (1) characterize and analyze the transcriptome of A. emeiensis, (2) determine the frequency and distribution of SSRs, (3) establish EST-SSR markers and examine the level of polymorphism, and (4) assess the genetic diversity of A. emeiensis among six populations.

Plant materials, DNA, and RNA isolation
A total of 186 samples of A. emeiensis were collected from six natural populations in Mount Omei, Sichuan Province, China, at Longdongcun (LDC), Longdonghu (LDH), Daping (DP), Wuhe (WH), Jinzhulin (JZL), and Chuanzhusi (CZS) (August 8, 2015). Field research was approved by the Emeishan Ecological Environment Bureau, Sichuan Province, China. The collected fresh leaves of A. emeiensis had a height of 8-10 cm. After freezing in liquid nitrogen, all leaves were stored at −80˚C until DNA and total RNA extraction. Genomic DNA was isolated using DNA Plantzol (QIAGEN, Dusseldorf, Germany), following the manufacturer's protocol. Voucher specimens were maintained in the herbarium of the Sichuan Institute of Natural Resources. Total RNA from ten different individuals was extracted using TRIzol reagent (Invitrogen Life Technologies, USA), following the manufacturer's instructions. The quality and concentration of RNA were determined using a NanoDrop ND-1000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Equal proportions of high-quality RNA were pooled from each individual to construct a cDNA library.

Transcriptome sequencing, assembly, and annotation
Following the methods described by Liu et al. [10,14,22], a normalized cDNA library of A. emeiensis leaves was constructed. Briefly, the total RNA was collected and the poly(A) mRNA was purified with magnetic oligo(dT) beads. Then the mRNA was cleaved to short fragments by an RNA Fragmentation Kit (Ambion, Austin, TX), which were then used as templates to reverse-transcribe first-strand cDNA using random hexamer primers and reverse transcriptase (Invitrogen, Carlsbad, CA). Second-strand cDNA was synthesized in a reaction containing buffer, dNTPs, RNase H (Invitrogen) and DNA polymerase I (Qingkeweiye Biological Technology, Chengdu). The library was paired-end synthesized according to Illumina Genomic Sample Preparation Kit manufacturer's instructions and sequenced using Illumina HiSeq 2000 (Illumina Inc., San Diego, CA, USA) at One Gene Company Limited, Hangzhou, China. Image analysis, base calling, and quality value calculations were performed using the Illumina GA Pipeline version 1.6. The raw Illumina sequencing data were then submitted to the NCBI Sequence Read Archive (accession number: SRR8138703) (SRA, http://www.ncbi.nlm.nih. gov/Traces/sra). Clean reads were obtained by removing those of low quality (more than 5% ambiguous bases "N") and reads with more than 20% low-quality bases (value of �10) from raw reads using FastQC Version 0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/ fastqc/). De novo assembly was performed using Trinity [12]. Generally, clean reads were combined to form longer fragments without Ns, termed contigs. The remaining isolated reads are described as singletons. Individual reads were then arranged back to the contigs using pairedend mapping. Subsequently, the order and distance between these contigs were determined from the same transcript. Finally, Trinity connected contigs and obtained sequences that could not be extended at either end, which were named as unigenes. These unigenes were further processed by sequence splicing and redundancy removal using TGICL version 2.1 to obtain nonredundant unigenes [23]. To intensively study the putative functions, the unigenes were compared using the BLASTX search (http://blast.ncbi.nlm.nih.gov/Blast.cgi) with a cutoff E value of 10 −5 against protein databases such as NCBI nonredundant protein sequences (Nr), Swiss-Prot (http://www.expasy.ch/sprot/), the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database (http://www.genome.jp/kegg), and the Clusters of Orthologous Groups (COG) database (http://www.ncbi.nlm.nih.gov/COG). To further annotate the unigenes, gene ontology (GO) was performed using the Blast2GO program (B2G; http://www. blast2go.com) [24].

Identification of EST-SSRs and primers design
SSR loci were detected using MIcroSAtellite software (MISA) (http://pgrc.ipkgatersleben.de/ misa) [16]. The minimum number of repeats was six for dinucleotides, five for tri-and tetranucleotides, four for penta-and hexanucleotides, and twelve for mononucleotides. PRIMER version 3.0 [25] was used to design the primer pairs with the following settings: 18-28 bases (optimum 23 bp), product size of 80-160 bp, annealing temperature between 55˚C and 65˚C (optimum 60˚C), and CG content from 40% to 60% (optimum 50%). The potential primer pairs, hairpin structure, and occurrence of mismatches were checked using OLIGO (version 6.67; Molecular Biology Insights, Inc., Cascade, CO, USA). Finally, the specificity of the primer pairs was checked using BLAST against EST sequences.

EST-SSR analysis
The EST-SSR markers were initially tested for amplification using DNA from 24 A. emeiensis individuals to optimize the annealing temperature. In total, 40 SSRs were selected, and M13 (universal sequence-TGTAAAACGACGGCCAGT) [26] was added to the 5' end of each forward primer of pairs and synthesized by Qingkeweiye Biological Technology (Chengdu) Co., Ltd. Polymerase chain reaction (PCR) amplification was performed using a thermal cycler Gen-eAmp PCR System 9700 (Applied Biosystems, Foster City, USA) in a 10 μl reaction mixture containing 1 μl of genomic DNA (50 ng), 0.25 unit Taq DNA polymerase (TaKaRa, Dalian, Liaoning, China), 1 × PCR buffer, 1 μl of 2.5 mM MgCl 2 , 1 μl of 2.5 mM dNTPs, 0.1 μl bovine serum albumin (TaKaRa), 0.5 μl of each 10 μM primer, and 0.5 μl of 10 μM fluorescent labeled M13 primer. The PCR amplification conditions were as follows: hot start at 95˚C for 5 min, followed by 35 cycles of denaturation at 94˚C for 45 s, annealing at various primer-specific temperatures for 45 s, extension at 72˚C for 1 min, and a final extension at 72˚C for 10 min. The PCR products were analyzed using a LI-COR 4300 DNA Analyzer (Gene Company Limited, USA), and their sizes were scored and compiled using the SAGA Generation 2 software (LI-COR).

Selective neutrality tests and population genetic analysis
The number of observed alleles (N A ), observed (H O ) and expected (H E ) heterozygosity, and polymorphism information content (PIC) in each locus were calculated to test the polymorphism level using CERVUS version 3.0.3 [27]. For each population and locus of A. emeiensis, the diversity and inbreeding parameters, such as N A , H E , allelic richness (R S , standardized for the smallest number of individuals per unit using rarefaction), and the average inbreeding coefficient across all loci (F IS ) were estimated using FSTAT version 2.9.3 [28].

1. Assembly of A. emeiensis transcriptome data, functional annotation, and categorization of unigenes
In total, 10.7 Gb, 71,389,320 raw reads of 90 bp in length and 64,177,276 clean reads (89.90%) with 97.72% Q20 bases (base quality greater than 20) were generated. The total length of the reads was approximately 9.6 Gb ( Table 1). The percentage of ambiguous bases and GC percentage for A. emeiensis were 0 and 47.70%, respectively. After TRINITY assembly, high-quality reads were assembled into 115,283 contigs with a mean length of 852 bp and an N50 length of 1375 bp. These contigs were then stitched into 78,381 unigenes with a mean length of 1002 bp and an N50 length of 1602 bp. Of the 78,381 unigenes with a total length of approximately 78.5 Mb, 33,002 unigenes (42.10%) had a length range of 200-500 bp, 35,462 unigenes . Regarding the similarity distribution, 86.5% of the Nr assembled sequences showed more than 40% similarity to known sequences, and 13.9% of the annotated sequences showed more than 80% similarity. For the species distribution, 16.4% (7395) annotated sequences were hit on Vitis vinifera, while 3.5-6.7% orthologous genes were distributed in Amborella trichopoda, Amygdalus persica, Chaetochloa italica, Zea mays, Japanese rice, and Theobroma cacao.

Polymorphism of EST-SSR markers and population genetic diversity
A total of 6683 primer pairs were successfully designed, with amplicons ranging from 80 to 240. Subsequently, 40 primer pairs were randomly selected. In the initial screening of 24  (Table 2).
In each population, the mean number of alleles (N A ) across the 10 surveyed SSR markers varied from 2.6 to 4.8, and the LDC population was the highest. The average allelic richness (R S ) values ranged from 2.6 to 3.153, and the LDC population was also the highest. The mean expected heterozygosity (H E ) was 0.0211-0.7134. The mean F IS fluctuated from −0.058 to 0.332, and the value in the LDH population was the highest (Table 3).

Characterization of the A. emeiensis transcriptome
Previously, no ESTs of A. emeiensis were available in public databases. In this study, 78,381 unigenes were obtained from A. emeiensis leaves with a mean length of 1002 bp, which were longer than those of previous transcriptome studies of A. formosanus (mean length 679 bp), Dendrobium officinale (mean length 742 bp), Panax ginseng (mean length 469 bp), and Siberian ginseng (mean length 785 bp), but lower than those of Panax vietnamensis (mean length 1304 bp) [18][19][20][21]29]. The number of unigenes assembled in A. emeiensis was lower than that reported for A. formosanus (173,513), Panax ginseng (178,145), and Panax vietnamensis (126,758), but higher than that in Dendrobium officinale (36,407) and Dysosma versipellis (44,855) [7]. Approximately 60.65% (47,541) unigenes were successfully annotated in public databases (Nr, Nt, Swiss-Prot, COG, GO, and KEGG). Compared with previous studies, the number of annotated unigenes in A. emeiensis was greater than that in Dendrobium officinale

Locus
Longdongcun (LDC, N = 83) Daping (DP, N = 45) Longdonghu (LDH, N = 27)  together with other species of Anoectochilus, have been simultaneously studied for novel gene discovery and exploration of their medicinal mechanism. Furthermore, 4.47% unigenes were annotated to the plant-pathogen interaction pathway, most of which belonged to environmental adaptation groups. Anoectochilus plants are symbiotic with fungi and gain nutrition from the environment. These genes will help reveal how tiny exalbuminous seeds can survive under extreme conditions.

Frequency and distribution of EST-SSRs
A total of 9284 EST-SSRs were mined and distributed among 7742 unigenes. The SSR frequency in the A. emeiensis transcriptome (9.88%) was higher than that in maize (1.5%) and rice (4.7%) but lower than that in A. formosanus (12.21%) and Neolitsea sericea (16.25%), and it was the same as that in many dicot species that fall in this range of values (2.65-16.82%) [7,29,33]. The variable results may be due to different SSR mining software and criteria [33].
In A. emeiensis, the SSR motif types were not evenly distributed. Trinucleotide repeats (39.84%) were the most frequent type, followed by dinucleotides (35.02%) and mononucleotides (18.85%). The hexa-, penta-, and tetranucleotide repeat types were much rarer (2.47, 2.04, and 1.79%, respectively). The present result was congruent with earlier suggestions that trinucleotide repeats are the main EST-SSR type in both mono-and dicots [7,[34][35][36][37]. This is in contrast to many conclusions that dinucleotide repeats are the dominant repeats [33]. Also, this result is different from A. formosanus, in which mononucleotides have been determined to be the most common type (61.93%) [31]. The predominant trimeric SSRs may be attributed to the suppression of non-trimeric SSRs in coding regions [32]. In A. emeiensis, AAG/CTT (13.50%) and AG/CT (23.37%) motifs were the most common tri-and dinucleotide repeats, and CCG/CGG (5.30%) was also frequent. These results are congruent with those of previous studies on Arabidopsis, soybean, cucumber, sesame, cocoa, and caster bean, which also suggested that the trinucleotide AAG motif may be prominent in dicots [33,38]

Polymorphism of EST-SSR markers and population genetic diversity and structure
In total, 53 alleles were detected across 186 A. emeiensis individuals from six populations with 10 polymorphic EST-SSR markers ( Table 2). The N A values across these markers ranged from 3 to 8 with a mean of 5.3, which were higher than those of Dysosma versipellis and D. pleiantha, but lower than that of Neolitsea sericea [7,14]. The average PIC value was 0.514 for the ten developed loci. The PIC is used to assess the genetic information level. The results of this study showed that the PIC value fell at a medium level and showed high polymorphism.
After years of investigation, I found that there were two reproductive modes in the wild A. emeiensis community: asexual reproduction and insect pollination. Perennial rhizomes with induced A. emeiensis buds can grow to more than 20 cm. The new shoots grow completely from the succulent stem nodes. Although wild populations lack pollinating insects, A. emeiensis can occasionally be pollinated by insects. These factors contribute to the unique genetic structure of A. emeiensis.
Typical habitat fragmentation occurs in wild populations of A. emeiensis. Owing to the construction of roads, parking lots, and hotels, the distribution area has been isolated into several groups. Wild A. emeiensis was concentrated in the LDC, DP, LDH, and JZL populations. Allelic richness (Rs) was positively correlated with population size (Table 3). Almost all of the detected alleles (N A ) declined in the small populations (Table 3). For example, in the LDC population, the N A of U1017 was 8 but declined to 4 in CZS and 3 in WH. Thus, four and five alleles of U1017 were lost in the CZS and WH populations, respectively. The lowest allele frequency of U1017 was 174. However, this allele was lost in the JZL, CZS, and WH populations (S1 Table). The results showed that when the population becomes smaller owing to habitat fragmentation, the number of alleles in the remaining population mainly depends on the size of the population and the allele frequency distribution [39,40]. Alleles with lower frequencies will be lost.
In each population, inbreeding was prominent among the individuals (Table 3): 83.33% of the mean inbreeding coefficient (F IS ) was positive, providing evidence for this. Inbreeding increases homozygosity and reduces heterozygosity [41]. In the continuous inbreeding population, the frequency of heterozygotes tended to be zero. This was proven by the fact that H E of many alleles in four larger populations (LDC, DP, LDH, and JZL) was 0. Lacy and Lindenmayer (1995) found that when the population is divided into different small parts, heterozygosity and allelic diversity decrease rapidly [42]. The results of this study support this conclusion.
Overall, the EST-SSRs derived from this study are of effective quality in A. emeiensis. These markers will aid in investigating the genetic structure of wild populations, and the generated database will help us understand the molecular synthesis mechanisms of medicinal components in A. emeiensis.

Conclusions
In this study, I performed the first transcriptome analysis of A. emeiensis leaves using the Illumina HiSeq 2000 sequencing system. A total of 78,381 unigenes were assembled from 64.2 million reads, and 47,541 (60.65%) unigenes were matched to known proteins in public databases. Then, 9284 EST-SSRs were identified, and the frequency of SSRs in the A. emeiensis transcriptome was 9.88%. Trinucleotide repeats (3699, 39.84%) were the most common type, followed by dinucleotide (3251, 35.02%) and mononucleotide (1750, 18.85%) repeats. Based on the SSR sequence, 6683 primer pairs were successfully designed, 40 primer pairs were randomly selected, and 10 primer pairs were identified as polymorphic loci from 186 individuals of A. emeiensis. The EST-SSR markers examined in this study will be informative for future population genetic studies of A. emeiensis.
Supporting information S1 Table. Raw allele scoring table based on the